Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracing function runtime patch #1060

Open
wants to merge 33 commits into
base: master
Choose a base branch
from

Conversation

Pavel-Durov
Copy link
Contributor

  • Move tracing function from ykcapi to ykrt.
  • Add runtime patching with a single ret instruction to the tracing function when jit transition between tracing and stop-tracing (including side traces).

ykrt/src/trace/swt/patch.rs Outdated Show resolved Hide resolved
ykrt/src/trace/swt/patch.rs Outdated Show resolved Hide resolved
let result = mprotect(
func_address,
page_size_aligned,
PROT_READ | PROT_WRITE | PROT_EXEC,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need the PROT_EXEC here since you are changing it back to executable again below. In fact, on some systems you are not allowed to mark a region writeable and executable at the same time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.
But when I remove PROT_EXEC I get some of the tests failing in a non-deterministic way 🤔
I guess its because somehow test execution overlaps and this runtime patching takes effect while other tests try to execute these instructions.
I get the same result when I run the tests sequentially as:

YKB_TRACER=swt cargo test -- --test-threads=1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, on some systems you are not allowed to mark a region writeable and executable at the same time.

I think Lukas's point is that you can mprotect(PROT_READ | PROT_WRITE) or mprotect(PROT_READ | PROT_EXEC) but you can't combine PROT_EXEC and PROT_WRITE: you have to write the page and only then mark it as executable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated 👉 40c0696
I get that :)
My issue was with failing the test, but I can't reproduce it anymore 😶

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that I can't combine PROT_WRITE and PROT_EXEC but..

I again get this issue with failing tests if I set PROT_READ | PROT_WRITE before callingcopy_nonoverlapping (patching the function):

Exited due to signal: 11

But when I set it as PROT_READ | PROT_WRITE | PROT_EXEC it works fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The normal trick here is: set to PROT_READ|PROT_WRITE do the writes; then set it to PROT_READ | PROT_EXEC. That way you never have PROT_WRITE and PROT_EXEC set at once.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand and I wish it would work for me but I get segfault when I do it.
Will dig deeper.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm going mad 🙃
Its working fine with PROT_READ | PROT_WRITE on patch and PROT_READ | PROT_EXEC restore.
updated 👉 6f4bdc3

@Pavel-Durov
Copy link
Contributor Author

I still need to perform benchmarking to compare hwt / swt / swt + noop / swt + runtime patch runtimes.
I would hold off on the PR review until it's done as it might introduce more changes.

@Pavel-Durov
Copy link
Contributor Author

Pavel-Durov commented Mar 30, 2024

Benchmarks completed.

Benchmarks - JIT (threshold=5)

YK Range (min … max) Time (mean ± σ) Runs
swt + runtime-patch 3.140 s … 5.266 s 3.895 s ± 0.654 s 50
swt 3.221 s … 5.163 s 4.035 s ± 0.638 s 50
swt + nop 288.6 ms … 520.1 ms 420.4 ms ± 92.6 ms 50
hwt 3.444 s … 5.503 s 4.453 s ± 0.775 s 50

Benchmarks - No JIT (threshold=9999999)

YK Range (min … max) Time (mean ± σ) Runs
swt + runtime-patch 248.4 ms … 257.6 ms 252.8 ms ± 2.1 ms 50
swt + nop 44.3 ms … 57.0 ms 54.3 ms ± 3.0 ms 50
swt 244.9 ms … 268.4 ms 259.2 ms ± 4.5 ms 50
hwt 35.8 ms … 44.5 ms 39.4 ms ± 1.0 ms 50

Summary

Looks like the runtime patching doesn't give the same performance gain as we've seen with nop benchmarks (x4 boost), there's probably another overhead involved.
Or maybe we need to find the right threshold value for yk_mt_hot_threshold_set to see the impact?

Command used:

YKD_SERIALISE_COMPILATION=1 hyperfine -w 2 -m 50 './src/lua ./tests/closure.lua'

@ltratt
Copy link
Contributor

ltratt commented Mar 30, 2024

Let's try this on a few other benchmarks, including those that make longer traces (with short traces we wouldn't necessarily expect to see much difference).

@Pavel-Durov
Copy link
Contributor Author

Disabling trace compilation:

diff --git ykrt/src/mt.rs ykrt/src/mt.rs
index 8e9f77c4..c02b6420 100644
--- ykrt/src/mt.rs
+++ ykrt/src/mt.rs
@@ -630,11 +630,11 @@ impl MT {
             // spin up a new thread for each compilation. This is only acceptable because a)
             // `SERIALISE_COMPILATION` is an internal yk testing feature b) when we use it we're
             // checking correctness, not performance.
-            thread::spawn(do_compile).join().unwrap();
+            // thread::spawn(do_compile).join().unwrap();
             return;
         }
 
-        self.queue_job(Box::new(do_compile));
+        // self.queue_job(Box::new(do_compile));
     }
 

Does show the difference as before - roughly x4:
yk+swt master

Benchmark 1: ./src/lua ./stats/matrix/matrix.lua 50
  Time (mean ± σ):     330.8 ms ±   7.3 ms    [User: 424.2 ms, System: 36.4 ms]
  Range (min … max):   321.6 ms … 337.4 ms    10 runs

yk-swt/tracing-function-runtime-patch:

Benchmark 1: ./src/lua ./stats/matrix/matrix.lua 50
  Time (mean ± σ):      1.444 s ±  0.344 s    [User: 1.540 s, System: 0.031 s]
  Range (min … max):    1.102 s …  1.784 s    10 runs

ykrt/src/trace/swt/patch.rs Outdated Show resolved Hide resolved
let page_size = sysconf(libc::_SC_PAGESIZE) as usize;

let func_address = ((function_ptr as usize) & !(page_size - 1)) as *mut c_void;
let page_size_aligned = (((function_ptr as usize) + mem::size_of_val(&function_ptr))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use std::alloc::Layout::align and friends to do this calculation for us.

I'm also not sure what the difference between function_ptr and func_address is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also not sure what the difference between function_ptr and func_address is.

My understanding/intention here is that function_ptr is just the pointer to the function while func_address is the page address (maybe it should be renamed)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated 👉 8e00241

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might get lost but there are some comments on 8e00241.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I did missed that! sorry

// This unwrap should be safe since we are using a page that is
// based on function_ptr with a known location.
let layout = Layout::from_size_align(start_offset, page_size)
.expect("Failed to create layout for function memory page");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just use unwrap as all these expects really bloat the code, especially as they shouldn't ever trigger. [As this suggests we almost never use expect: it does have a place, but it's really rare for us.]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.
updated 👉 0428dc1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get rid of the other expects introduced in this PR too please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any other expects only a single unwrap.
Am I missing it somehow?

}
// Set function memory page as writable.
// Ignoring mprotect call failure.
mprotect(page_address, layout.size(), PROT_READ | PROT_WRITE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to ignore the return code: we just want to abort changing the machine code. So we need something like if mprotect(...) { return ... } and then recover from there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to check: is just return enough for the system to keep running? If so, let's add a comment to that effect.

Alternatively, if we think "if mprotect has failed, we've got something wrong", then we might want to panic if mprotect fails.

Copy link
Contributor Author

@Pavel-Durov Pavel-Durov Apr 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's ok to return if mrotect as it should be transactional and not leave the memory in corrupted state on error. Unless we consider the "runtime patching" as a must-have for SWT, then we should definitely panic.

I tested it with different invalid memory addresses that resulted in mrotect returning a non-0 result and the test suite passed.
We still risk changing memory that will not necessarily cause mrotect to return with an error but will cause some other part of the process to have invalid memory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think this runtime patching is an integral part of SWT.
Added panic 👉 6f0abb3

@ltratt
Copy link
Contributor

ltratt commented Apr 25, 2024

Hmm, I've just realised a possible blocker with this: multi-threading. That's going to require some very careful handling.

@Pavel-Durov
Copy link
Contributor Author

Hmm, I've just realised a possible blocker with this: multi-threading. That's going to require some very careful handling.

Yes, that would be a problem with instruction patching but I thought we only support serialised compilation in YK?

@ltratt
Copy link
Contributor

ltratt commented Apr 25, 2024

yk is definitely multi-threaded but we tend to temporarily turn that off because of problems with llvm. When the new JIT codegen is ready, we'll be back to fully multi-threaded.

Let's have a chat about this PR in a day or two. It's still very useful!

@ltratt
Copy link
Contributor

ltratt commented Dec 4, 2024

@Pavel-Durov I think this PR is probably never going to be resurrected, at least not in the sense of "merged into master". If you agree, I suggest we close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants